{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "702be7fc",
   "metadata": {},
   "source": [
    "## Pip installations\n",
    "These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. \n",
    "\n",
    "##### Example for transform developers working from git clone:\n",
    "\n",
    "make venv \n",
    "\n",
    "source venv/bin/activate \n",
    "\n",
    "pip install jupyterlab\n",
    "\n",
    "venv/bin/jupyter lab\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63e0484e",
   "metadata": {},
   "source": [
    "# Prerequisites\n",
    "\n",
    "***Rust*** is required to be installed on the system locally in order to run. To install, review here: https://www.rust-lang.org/tools/install\n",
    "\n",
    "### Add Rust to $PATH\n",
    "If Rust is **not** added to your `$PATH`, run the cell below to add the rust installation location for proper execution. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be41d57b-28a1-4a35-b180-1fe6d9f1a3ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pathlib\n",
    "import os\n",
    "\n",
    "result = !whereis cargo\n",
    "cargo_path = os.path.join(pathlib.Path(result[0].split(' ')[1]).parent, '')\n",
    "os.environ['PATH'] = os.environ['PATH'] + f':{cargo_path}'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c8a4f8d",
   "metadata": {},
   "source": [
    "## Import required classes and modules\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be791e3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dpk_rep_removal.runtime import RepRemoval"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56d78079",
   "metadata": {},
   "source": [
    "## Command Line Parameters\n",
    "For a full list of command line parameters, please refer to [here](./README.md#input-parameters).\n",
    "\n",
    "In this notebook, all default values are used, except:\n",
    "\n",
    "| Parameter                          | Used here                          | Description                                       |\n",
    "|------------------------------------|------------------------------------|---------------------------------------------------|\n",
    "| `rep_removal_contents_column_name` | `text`                             | Name of the column holding the document contents  |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d065b6c3",
   "metadata": {},
   "source": [
    "## Setup runtime parameters for this transform and invoke the transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d6a5f8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "RepRemoval(input_folder= \"test-data/input\",\n",
    "            output_folder= \"test-data/output\",\n",
    "            rep_removal_contents_column_name='text', \n",
    "            ).transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ce81bef",
   "metadata": {},
   "source": [
    "\n",
    "### The specified output_folder will include the transformed parquet files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5c96f697-b957-43ee-8244-8a19da75b721",
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "glob.glob(\"test-data/output/*\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e92a1fac-3ec2-4e44-9204-63a9037036c8",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dpk_outer",
   "language": "python",
   "name": "dpk_outer"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
